COVID-19 Data Analysis¶

This project explores COVID-19 vaccination rates, infection rates, and death counts. We use descriptive statistics and data visualization to identify trends and patterns.

In [13]:
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

The dataset contains COVID data from around 187 countries.

In [14]:
data = pd.read_csv('OneDrive/NSDCprojects/Capsule_COVID_Data_Visualization/country_wise_latest.csv')
data.head()
Out[14]:
Country/Region Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase WHO Region
0 Afghanistan 36263 1269 25198 9796 106 10 18 3.50 69.49 5.04 35526 737 2.07 Eastern Mediterranean
1 Albania 4880 144 2745 1991 117 6 63 2.95 56.25 5.25 4171 709 17.00 Europe
2 Algeria 27973 1163 18837 7973 616 8 749 4.16 67.34 6.17 23691 4282 18.07 Africa
3 Andorra 907 52 803 52 10 0 0 5.73 88.53 6.48 884 23 2.60 Europe
4 Angola 950 41 242 667 18 1 0 4.32 25.47 16.94 749 201 26.84 Africa

At first glance, we can examine whether there is a relationship between the number of confirmed cases and the number of deaths.

In [15]:
print(data[['Deaths','Recovered']].corr())
sns.scatterplot(x='Deaths', y='Recovered', data=data)
plt.show()
             Deaths  Recovered
Deaths     1.000000   0.832098
Recovered  0.832098   1.000000
No description has been provided for this image

The statistical data about the data:

In [16]:
data.describe()
Out[16]:
Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase
count 1.870000e+02 187.000000 1.870000e+02 1.870000e+02 187.000000 187.000000 187.000000 187.000000 187.000000 187.00 1.870000e+02 187.000000 187.000000
mean 8.813094e+04 3497.518717 5.063148e+04 3.400194e+04 1222.957219 28.957219 933.812834 3.019519 64.820535 inf 7.868248e+04 9448.459893 13.606203
std 3.833187e+05 14100.002482 1.901882e+05 2.133262e+05 5710.374790 120.037173 4197.719635 3.454302 26.287694 NaN 3.382737e+05 47491.127684 24.509838
min 1.000000e+01 0.000000 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.00 1.000000e+01 -47.000000 -3.840000
25% 1.114000e+03 18.500000 6.265000e+02 1.415000e+02 4.000000 0.000000 0.000000 0.945000 48.770000 1.45 1.051500e+03 49.000000 2.775000
50% 5.059000e+03 108.000000 2.815000e+03 1.600000e+03 49.000000 1.000000 22.000000 2.150000 71.320000 3.62 5.020000e+03 432.000000 6.890000
75% 4.046050e+04 734.000000 2.260600e+04 9.149000e+03 419.500000 6.000000 221.000000 3.875000 86.885000 6.44 3.708050e+04 3172.000000 16.855000
max 4.290259e+06 148011.000000 1.846641e+06 2.816444e+06 56336.000000 1076.000000 33728.000000 28.560000 100.000000 inf 3.834677e+06 455582.000000 226.320000

Next, we will choose a specific WHO Region as the focus of the first part of our data visualization.

In [17]:
who_region_data = data.groupby('WHO Region').sum().reset_index()

# Plot the trends for confirmed cases, deaths, and recoveries across WHO regions
fig, axes = plt.subplots(3, 1, figsize=(8,12))

sns.barplot(ax=axes[0], x='WHO Region', y='Confirmed', data=who_region_data)
axes[0].set_title('Confirmed Cases by WHO Region')
axes[0].set_ylabel('Confirmed Count Across WHO Regions')

sns.barplot(ax=axes[1], x='WHO Region', y='Deaths', data=who_region_data)
axes[1].set_title('Deaths by WHO Region')
axes[1].set_ylabel('Death Count Across WHO Regions')

sns.barplot(ax=axes[2], x='WHO Region', y='Recovered', data=who_region_data)
axes[2].set_title('Recoveries by WHO Region')
axes[2].set_ylabel('Recovered Count Across WHO Regions')

plt.tight_layout()
plt.show()
No description has been provided for this image

Because the Amercias had the most confirmed, deaths, and recovery cases among all WHO Regions, we will first focus our analysis on cases in the Americas.

Data Analysis of Cases in the Americas¶

In [18]:
d_america = data[data['WHO Region'] == 'Americas'].sort_values(by='Confirmed', ascending = False)
new_data = d_america.head(10) #top 10 countries
new_data
Out[18]:
Country/Region Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase WHO Region
173 US 4290259 148011 1325804 2816444 56336 1076 27941 3.45 30.90 11.16 3834677 455582 11.88 Americas
23 Brazil 2442375 87618 1846641 508116 23284 614 33728 3.59 75.61 4.74 2118646 323729 15.28 Americas
111 Mexico 395489 44022 303810 47657 4973 342 8588 11.13 76.82 14.49 349396 46093 13.19 Americas
132 Peru 389717 18418 272547 98752 13756 575 4697 4.73 69.93 6.76 357681 32036 8.96 Americas
35 Chile 347923 9187 319954 18782 2133 75 1859 2.64 91.96 2.87 333029 14894 4.47 Americas
37 Colombia 257101 8777 131161 117163 16306 508 11494 3.41 51.02 6.69 204005 53096 26.03 Americas
6 Argentina 167416 3059 72575 91782 4890 120 2057 1.83 43.35 4.21 130774 36642 28.02 Americas
32 Canada 116458 8944 0 107514 682 11 0 7.68 0.00 inf 112925 3533 3.13 Americas
51 Ecuador 81161 5532 34896 40733 467 17 0 6.82 43.00 15.85 74620 6541 8.77 Americas
20 Bolivia 71181 2647 21478 47056 1752 64 309 3.72 30.17 12.32 60991 10190 16.71 Americas
In [19]:
new_data.describe()
Out[19]:
Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase
count 1.000000e+01 10.000000 1.000000e+01 1.000000e+01 10.000000 10.000000 10.000000 10.000000 10.000000 10.0000 1.000000e+01 10.000000 10.000000
mean 8.559080e+05 33621.500000 4.328866e+05 3.893999e+05 12457.900000 340.200000 9067.300000 4.900000 51.276000 inf 7.576744e+05 98233.600000 13.644000
std 1.398190e+06 48126.771919 6.313155e+05 8.643672e+05 17243.211981 350.807893 12164.741273 2.832749 27.626579 NaN 1.242506e+06 157603.809848 8.272305
min 7.118100e+04 2647.000000 0.000000e+00 1.878200e+04 467.000000 11.000000 0.000000 1.830000 0.000000 2.8700 6.099100e+04 3533.000000 3.130000
25% 1.291975e+05 6343.250000 4.431575e+04 4.720625e+04 1847.250000 66.750000 696.500000 3.420000 33.925000 5.2275 1.173872e+05 11366.000000 8.817500
50% 3.025120e+05 9065.500000 2.018540e+05 9.526700e+04 4931.500000 231.000000 3377.000000 3.655000 47.185000 8.9600 2.685170e+05 34339.000000 12.535000
75% 3.940460e+05 37621.000000 3.159180e+05 1.147508e+05 15668.500000 558.250000 10767.500000 6.297500 74.190000 13.9475 3.556098e+05 51345.250000 16.352500
max 4.290259e+06 148011.000000 1.846641e+06 2.816444e+06 56336.000000 1076.000000 33728.000000 11.130000 91.960000 inf 3.834677e+06 455582.000000 28.020000

The first plot is a bar plot describing the active cases among countries in the Americas.

In [20]:
active = new_data[['Country/Region','Active']].sort_values(by='Active',ascending=False)
sns.barplot(y=active.get('Country/Region'),x=active.get('Active'))
Out[20]:
<Axes: xlabel='Active', ylabel='Country/Region'>
No description has been provided for this image

Insights:

  • For countries in the Americas, the US had far more active counts than any other countries, with almost five times more active counts than Brazil, the country with the second most active counts. This strongly suggests that the US suffered the most from COVID-19 within the Americas.

The second chart is a double bar chart to compare Confirmed and Recovered COVID Cases in different countries

In [21]:
plt.figure(figsize=(10, 5))
X_axis = np.arange(len(new_data['Confirmed']))
plt.bar(X_axis - 0.2,new_data['Confirmed'] , 0.4, label = 'Confirmed')
plt.bar(X_axis + 0.2, new_data['Recovered'], 0.4, label = 'Recovered')

plt.xticks(X_axis,new_data['Country/Region'] )
plt.xlabel("Confirmed")
plt.ylabel("Recovered")
plt.title("Comparing the Confirmed and Recovered Counts in the Americas")

plt.legend()
plt.show()
No description has been provided for this image

Insights:

  • For countries in the Americas, the US and Brazil has the most confirmed and recovered patients.
  • While the US had more confirmed counts, Brazil had more Recovered patients, suggesting Brazil had better control over the pandemic than the US.
  • Smaller countries like Mexico and Chile had about the same amount of confirmed and recovered patients, which means these countries had effective control over the pandemic.

The third is a pie chart describing the recovered cases.

In [22]:
plt.figure(figsize=(8,8))
patches, text, autotexts = plt.pie(new_data['Recovered'], labels = new_data['Country/Region'],autopct="%0.2f%%", pctdistance=0.8)
plt.title("Distribution of Recovered COVID Cases") #Hint: Check the TODO statement!
plt.axis('equal')
plt.legend(patches,new_data['Country/Region'] )
plt.show()
No description has been provided for this image

Insights:

  • Brazil had the most amount of recovered COVID cases, suggesting its effective control over the pandemic.
  • Canada had no recovered COVID cases, suggesting the need for improvements in its control strategies.

Next, we are going to compare Active cases in the Americas using a donut chart.

In [24]:
x = new_data['Active'].to_list()
labels = new_data['Country/Region']
colors = ['#0F52BA','#4169E1', '#0096FF',
          '#87CEEB','#89CFF0', '#7DF9FF']
explode = (0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05)

plt.pie(x, colors=colors, labels=labels,
        autopct='%1.1f%%', pctdistance=0.85,
        explode=explode)

centre_circle = plt.Circle((0, 0), 0.65, fc='white')
fig = plt.gcf()

fig.gca().add_artist(centre_circle)

plt.title('Active Cases in the Americas')
plt.show()
No description has been provided for this image

Insights:

  • The US had the highest number of active cases in the Americas, accounting for over 70% of all active cases in the region. This indicates that the COVID situation in the US was the most severe in the Americas.

We will then visualize the amount of confirmed cases in the Americas through a heatmap with Choropleth.

In [25]:
fig = px.choropleth(new_data,
                    locations= 'Country/Region',
                    locationmode='country names',
                    color='Confirmed',
                    color_continuous_scale='Reds',
                    hover_name='Country/Region',
                    title='Total Confirmed Cases in the Americas')

fig.show()

Insights:

  • North America generally had more confirmed COVID cases than South Anmerica.
  • Although the US had the most serious pandemic, Canada wasn't affected really seriously.

Finally, in order to see the influence of the pandemic, we will calculate mortality and recovery rates, aggregate the data by WHO region, and then plot these rates using a barplot.

In [26]:
data['Mortality Rate'] = (data['Deaths'] / data['Confirmed']) * 100
data['Recovery Rate'] = (data['Recovered'] / data['Confirmed']) * 100

# Aggregate data by WHO Region
numeric_columns = ['Confirmed', 'Deaths', 'Recovered', 'Active', 'Mortality Rate', 'Recovery Rate']
who_region_rates = data.groupby('WHO Region')[numeric_columns].mean().reset_index()

fig, axes = plt.subplots(2, 1, figsize=(8, 8))

sns.barplot(ax=axes[0], x='WHO Region', y='Mortality Rate', data=who_region_rates)
axes[0].set_title('Mortality Rate by WHO Region')
axes[0].set_ylabel('Mortality Rate (%)')

sns.barplot(ax=axes[1], x='WHO Region', y='Mortality Rate', data=who_region_rates)
axes[1].set_title('Mortality Rate by WHO Regions')
axes[1].set_ylabel('Mortality Rate)')

plt.tight_layout()
plt.show()
No description has been provided for this image

Insights:

  • Europe had the most mortality rate, suggesting the most serious pandemic conditions in Europe.
  • South-East Asia and Western Pacific had the least mortality rate, suggesting their effective control during COVID-19.

Data Analysis for Cases in Southeasst Asia¶

Southeast Asia was chosen as the focus region because it is geographically close to China, where COVID-19 was first reported. Due to this proximity, countries in Southeast Asia were exposed to the virus relatively early, making the region important for understanding how the outbreak spread and how different countries were affected.

In [29]:
asia_data = data[data.get('WHO Region') == 'South-East Asia']
asia_data
Out[29]:
Country/Region Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase WHO Region Mortality Rate Recovery Rate
13 Bangladesh 226225 2965 125683 97577 2772 37 1801 1.31 55.56 2.36 207453 18772 9.05 South-East Asia 1.310642 55.556636
19 Bhutan 99 0 86 13 4 0 1 0.00 86.87 0.00 90 9 10.00 South-East Asia 0.000000 86.868687
27 Burma 350 6 292 52 0 0 2 1.71 83.43 2.05 341 9 2.64 South-East Asia 1.714286 83.428571
79 India 1480073 33408 951166 495499 44457 637 33598 2.26 64.26 3.51 1155338 324735 28.11 South-East Asia 2.257186 64.264803
80 Indonesia 100303 4838 58173 37292 1525 57 1518 4.82 58.00 8.32 88214 12089 13.70 South-East Asia 4.823385 57.997268
106 Maldives 3369 15 2547 807 67 0 19 0.45 75.60 0.59 2999 370 12.34 South-East Asia 0.445236 75.601069
119 Nepal 18752 48 13754 4950 139 3 626 0.26 73.35 0.35 17844 908 5.09 South-East Asia 0.255973 73.346843
158 Sri Lanka 2805 11 2121 673 23 0 15 0.39 75.61 0.52 2730 75 2.75 South-East Asia 0.392157 75.614973
167 Thailand 3297 58 3111 128 6 0 2 1.76 94.36 1.86 3250 47 1.45 South-East Asia 1.759175 94.358508
168 Timor-Leste 24 0 0 24 0 0 0 0.00 0.00 0.00 24 0 0.00 South-East Asia 0.000000 0.000000

The following scatter plot describes the relationship between the 1 week % increase and the recovery rate. The color of each point represents the amount of confirmed cases. The larger the point, the more confirmed cases.

In [30]:
plt.scatter(asia_data.get('1 week % increase'), asia_data.get('Recovery Rate'), c = asia_data.get('Confirmed'), cmap = 'cividis')
plt.xlabel('1 Week % Increase')
plt.ylabel('Recovery Rate')
plt.title('Recovery Rate vs. 1 Week % Increase')

for i in range(len(asia_data)):
    x = asia_data.get('1 week % increase').iloc[i]
    y = asia_data.get('Recovery Rate').iloc[i]
    label = asia_data.get('Country/Region').iloc[i]

    plt.annotate(
        label,
        (x, y),
        textcoords="offset points",
        xytext=(5, 5),
        ha='center',
        fontsize=8
    )

plt.show()
No description has been provided for this image

Insights:

  • Overall, there is no strong linear relationship between the two variables. Countries with a low weekly increase can still have either high or low recovery rates. For example, Thailand, Bhutan, and Sri Lanka have low 1-week increases but relatively high recovery rates, suggesting that slower case growth may be associated with better recovery outcomes in some countries.
  • On the other hand, India stands out as an outlier. It has the highest 1-week percentage increase, while its recovery rate is only moderate compared to other countries. This may indicate that rapid case growth can put pressure on healthcare systems, slowing recovery.
  • Timor-Leste is another notable outlier, with both a very low increase and an extremely low recovery rate. This could reflect limited healthcare capacity or delays in reporting recoveries.

The second Data Visualization of data in Southeast Asia is a heatmap showcasing the mortality rate.

In [31]:
fig = px.choropleth(
    asia_data,
    locations='Country/Region',
    locationmode='country names',
    color='Mortality Rate',
    hover_name='Country/Region',
    color_continuous_scale='Blues',
    title='COVID-19 Mortality Rate in Asia'
)

fig.show()

Insights:

  • From the heatmap, countries in Southeast Asia show noticeable variation in mortality rates. Some countries are shaded darker blue, indicating higher mortality rates, while others remain much lighter. This suggests that even within the same region, outcomes differed significantly.
  • Also, countries closer to China showed a lighter color, which means that geographic proximity alone does not fully determine mortality rate. Factors such as healthcare capacity, population density, government response, and reporting practices likely played a role in these differences.

The third visualization on the Southeast Asia region is a pie chart describing the amount of active cases in the top five countries.

In [32]:
top5 = asia_data.sort_values('Active', ascending=False).head(5)
others = asia_data['Active'].sum() - top5['Active'].sum()

labels = list(top5['Country/Region']) + ['Others']
sizes = list(top5['Active']) + [others]

plt.figure(figsize=(10, 10))

wedges, texts, autotexts = plt.pie(
    sizes,
    labels=labels,
    autopct='%1.1f%%',
    startangle=140,
    colors = plt.cm.YlOrBr(np.linspace(0.4, 1, len(sizes))),
    wedgeprops={'edgecolor': 'white', 'linewidth': 1},
    pctdistance=0.5
)

for text in texts:
    text.set_fontsize(10)
for text in autotexts:
    text.set_fontsize(10)
    text.set_fontweight('bold')

plt.title('Distribution of Active COVID-19 Cases in Asia (Top 5 Countries)')
plt.axis('equal')

plt.show()
No description has been provided for this image

Insights:

  • A small number of countries account for a large proportion of active cases, while the remaining countries contribute a much smaller share.
  • While according to the pie chart, India had the most active cases, it's mortality rate wasn't the highest, suggesting its effective control over the pandemic.